Testing for GxE interaction
in structured populations

Andrey Ziyatdinov

May 5, 2017

Definitions of GxE interaction

Biological interaction: Genetic factor(s) and environmental factor(s) participate in the same causal mechanism (Rothman et al., 2008)


Statistical interaction using linear regression (unrelated individuals):

\(y = \mu + \beta_g x_g + \beta_e x_e + \beta_{int} x_g \times x_e + e\)

Current data sets and methods in GWAS of GxE

Consortium Sample size Exposure Outcome Reference
CHARGES + SPIROMETA 50,047 Smoking Pulmonary function (Hancock et al., 2012)
SUNLIGHT 35,000 Vitamin D intake Circulating Vitamin D level (Wang et al., 2010)
GIANT up to 339,224 Gender Anthropometric traits (Heid et al., 2010)
…


  • Most studies used a full framework (model on the previous slide)
    • GxE analysis on the full sample
  • Other studies used a stratified framework (coming on the next slides)
    • GxE analysis on sub-samples stratified by exposure

GxE full framework in GWAS

Example: G x smoking in pulmonary function outcomes (Hancock et al., 2012)

  • 50,047 participants from 19 studies; ~2.5M SNPs
  • outcomes: FEV1, FEV1/FVC (%)
  • smoking variables: ever-smoker, current-smoker, packs-year
    • marginal terms: all included
    • interaction terms: tested separately
  • joint test: \(\beta_g = 0\) and \(\beta_{int} = 0\) under the null (Aschard et al., 2011)


Findings: three novel gene regions

  1. DNER
  2. HLA-DQB1 and HLA-DQA2
  3. KCNJ2 and SOX9

GxE stratified framework in GWAS

Example: G x gender in the Genetic Investigation of Anthropometric Traits (GIANT) consortium

  • outcome: Waist-hip ratio (WHR)
  • gender as an exposure
    • 108,979 women
    • 82,483 men
  • marginal GWAS: 14 loci associated to WHR
  • explained variance in WHR by 14 loci
    • 1.34% in women
    • 0.46% in men

Findings:

  • stratified interaction analysis:
    • 7/14 loci showed sex-specificity
    • 2/14 genome-wide significant

Other negative results were not published…

because of the challenges inherant to the detection of GxE

GxE are interesting to know, difficult to detect

Statistical power for interaction tests is much lower than for similar tests of marginal genetic effects (Murcray et al., 2011)

It also faces other potential issues (Aschard et al., 2012):

  • Confounding
  • Exposure measurement error and misclassification
  • Dynamics of gene–environment interactions
  • …

Relatedness is yet another layer of complexity in GxE analysis,
which impact on the full/stratified GxE frameworks is seldom explored.

Structure among individuals considered

  • shared environment: house-hold groups
  • recent genetic relatedness: family members
  • distant genetic relatedness: admixed populations

Our goal

Assess the relative performance of GxE methods
in the presence of structure

Outline

  1. GxE full framework in structured population
  2. GxE stratified framework in structured population
  3. Ancestry x E analysis in admixed population

1. GxE full framework
in structured population

Population structure in (marginal) association tests

Methods to account for relatedness are relatively well established
in marginal association studies (GWAS)

  • Principal component analysis (PCA)
  • Linear mixed models (LMMs) (Yang et al., 2014)
  • Robust tests: genotype-conditional association test (GCAT) (Song et al., 2015, NatGen)

Linear mixed model
(also our data simulation model)

\(y = X \beta + g + f + e\)

\(\mbox{where } g \perp f \perp e\)

  • \(g \sim (0, \sigma_g^2 K)\), the (additive) genetic effect
  • \(f \sim (0, \sigma_f^2 F)\), the house-hold/family effect
  • \(e \sim (0, \sigma_r^2 I)\), the residual error

\(\mbox{implying}\)

\(y \sim (X \beta, \sigma_g^2 K + \sigma_f^2 F + \sigma_r^2 I) = (X \beta, V)\)

  1. marginal effect:

\(X \beta = \mu + \beta_g x_g\)


  1. interaction effect:

\(X \beta = \mu + \beta_g x_g +\)
\(\mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \beta_e x_e + \beta_{int} x_{ge}\)

Estimation of model parameters

  1. Estimate variance components by ML/REML

\(\hat{V} = \hat{\sigma_g^2} K + \hat{\sigma_f^2} F + \hat{\sigma_r^2} I\)

  1. Derive the effect size as in Generalized Least Squares (GLS)

\(\hat{\beta} = (X^T \hat{V}^{-1} X)^{-1} X^T \hat{V}^{-1} Y\)

\(var(\hat{\beta}) = (X^T \hat{V}^{-1} X)^{-1}\)

From matrix to vector forms
(needed for the power derivation)

Simplify to a one-covariate model by orthogonalization

  1. marginal effect: \(E(Y) = \mu + \beta_g x_g\)

\(y^*\), centered \(y\)
\(x^*_g\), centered \(x_g\)

\(var(\hat{\beta}_g) = ({x^*_g}^T \hat{V}^{-1} x^*_g)^{-1}\)


  1. interaction effect: \(E(Y) = \mu + \beta_g x_g + \beta_e x_e + \beta_{int} x_{ge}\)

\(y^*\), centered \(y\)
\(x^*_{ge}\), centered \((x_g - \mu_g) (x_e - \mu_e)\)

\(var(\hat{\beta}_{int}) = ({x^*_{ge}}^T \hat{V}^{-1} x^*_{ge})^{-1}\)

Power

The power as a function of the non-centrality parameter (NCP)

\(NCP \approx \beta^2 tr(\hat{V}^{-1} \Sigma_x)\)

Data Distribution
outcome \(y \sim (X \beta, V) = (X \beta, \sigma_f^2 F + \sigma_g^2 K + \sigma_r^2 I)\)
predictor \(x \sim (\mu_x, \Sigma_x)\)

Type I error rate (before going to power)

In related individuals

  • Liner mixed models are well calibrated
  • Linear models have an inflated Type I error

Data simulation 1 (marginal): genetic relatedness

Data simulation of the whole sample (nuclear families):

  • \(y \sim (X \beta, V) = (\mu + \beta_g x_g, \sigma_g^2 K + \sigma_r^2 I)\)
  • \(\sigma_g^2 + \sigma_r^2 = 1\)
structure \(\Sigma_y\) = \(V\) \(\Sigma_x\) \(NCP \approx \beta^2 tr(\hat{V}^{-1} \Sigma_x)\)
unrelated \((\sigma_g^2 + \sigma_r^2) I\) \(\sigma_x I\) \(\beta^2 \mbox{ } 2pq \mbox{ } N\)
genetically related \(\sigma_g^2 K + \sigma_r^2 I\) \(\sigma_x K\) \(\beta^2 2pq \mbox{ } tr((\hat{\sigma}_g^2 K + \hat{\sigma}_e^2 I)^{-1} K)\)

Analytical results 1 (marginal): genetic relatedness

Analytical results = Simulation results

  • The power in unrelated is always higher
  • The difference in power depends on the heritability (\(\sigma_g^2\)) non-monotonically

Confirmed the known results

But our formula allows us to explore further performances across various study designs

Data simulation 2 (marginal): shared environment

Data simulation of the whole sample (nuclear families):

  • \(y \sim (X \beta, V) = (\mu + \beta_g x_g, \sigma_f^2 F + \sigma_r^2 I)\)
  • \(\sigma_f^2 + \sigma_r^2 = 1\)
structure \(\Sigma_y\) = \(V\) \(\Sigma_x\) \(NCP \approx \beta^2 tr(\hat{V}^{-1} \Sigma_x)\)
unrelated \((\sigma_f^2 + \sigma_r^2) I\) \(\sigma_x I\) \(\beta^2 \mbox{ } 2pq \mbox{ } N\)
shared environment \(\sigma_f^2 F + \sigma_r^2 I\) \(\sigma_x I\) \(\beta^2 2pq \mbox{ } tr((\hat{\sigma}_f^2 F + \hat{\sigma}_e^2 I)^{-1})\)

Analytical results 2 (marginal): shared environment

Analytical results = Simulation results

  • The power in individuals with shared environment increases as more variance is explained

Sumary

Marginal analysis

  1. Study designs: unrelated \(\gtrapprox\) genetically related
  2. The power increases as more variance is explained
    • by taking into account shared environment

GxE interaction analysis

  1. Study designs: genetically related \(>\) unrelated


Ongoing work

  • Simulations on correlated exposure \(x_e\) and shared environment \(f\)
  • GxE interactions in random effects (strata-specific variance decomposition)
  • Different family-designs designs: sib-pairs, trios

Part 2: GxE stratified framework
in structured population

GxE stratified framework

  1. Compute marginal genetic effects in stratas, e.g., males and females
    • \(\beta_m\) and \(\beta_f\), the genetic effects
    • \(\sigma_{\beta_m}\) and \(\sigma_{\beta_f}\), their standard errors
  2. Combine stratified results and perform tests
    • strata-specific, interaction (differentiated), joint, heterogeneity


Stratas Stratified interaction test Reference
Idependent \(Z_{int} = \frac{\beta_m - \beta_f}{\sqrt{\sigma_{\beta_m}^2 + \sigma_{\beta_f}^2}} \sim \mathcal{N}(0, 1)\) (Magi et al., 2010)
Related \(Z_{int} = \frac{\beta_m - \beta_f}{\sqrt{\sigma_{\beta_m}^2 + \sigma_{\beta_f}^2 + r \sigma_{\beta_m} \sigma_{\beta_f}}} \sim \mathcal{N}(0, 1)\) (Randall et al., 2013)

\(r\) is the spearman correlation between the two tests

  • a naive approach that needs further investigation (Sofer et al., 2016, GenEpi)

Simulation results 3: stratified \(\approx\) full

Data simulation of the whole sample (nuclear families, shared environment):

  • \(y \sim (X \beta, V) = (\mu + \beta_{int} x_g \times x_e, \sigma_g^2 K + \sigma_g^2 F + \sigma_r^2 I)\)
  • 2,500 individuals in 500 nuclear families
  • 20,000 SNPs under the null ; 10,000 under the alternative
  • effect size 0.1; independent genetic and exposure variables

Output: \(\rho = 0.167\) between stratas

Ongoing work

  • Derive analytical formulas for the GxE stratified framework
    • assess whether the Spearman correlation is robust enough
  • Explore more complex scenarios
    • outcomes in stratas are genetically correlated
    • imbalance in sample size


Bear in mind the results from the LD score regression for two outcomes (Bulik-Sullivan et al., 2015)

\(E[Z_{1j} Z_{2j}] = \frac{\sqrt{N_1 N_2} {\rho}_g}{M} l_{j} + \frac{N_s \rho}{\sqrt{N_1 N_2}}\)

Part 3: Ancestry x E analysis
in admixed population

COPDgene project

http://www.copdgene.org/

  • 10,000 ever-smokers
    • 3,300 admixed African-Americans
  • a rich set of COPD outcomes/exposures
  • SNPs, inferred local/global ancestry

Previous studies reported

  • Smoking is the major risk factor
  • African ancestry associated with increased risk of COPD

The project aims at leveraging the ancestry information in GxE tests

  • global ancestry \(\times\) exposure
  • local ancestry \(\times\) exposure
  • SNP \(\times\) exposure

Analytical plan

  • List of outcomes: FEV1, FVC, FEV1/FVC, …
  • List of exposure: ever-smoker, current-smoker, packs-year, gender, …

Planning

  • Initial plan: use global ancestry to capture structure (i.e first PC)

First results

Simulation on the null outcome

Summary

Thank you

Extra slides

Possible study designs for comparison

Study design 1 Study design 2 Study design 3
Sample Family-based Population-based Population-based
Relationships Kinship GRM
Method Linear mixed models Linear models Linear mixed models


GxE in study design 3 is our ongoing work (not presented today)


GxE in study designs 1 vs. 2 (today focus)

  • Compare two study designs unrelated/related
  • LMM performed using our new lme4qtl R package (under submission)

Simulation study

Given: a population of 50,000 related samples (nuclear families)

Experiment: pool 5,000 unrelated samples or pool randomly

relatedness \(V\) \(\Sigma_x\) Normalization
unrelated \(\sigma_g^2 K + \sigma_r^2 I = (\sigma_g^2 + \sigma_r^2) I\) \(\sigma_x I\) \(\sigma_g^2 + \sigma_r^2 = 1\)
genetically related \(\sigma_g^2 K + \sigma_r^2 I\) \(\sigma_x K\) \(\sigma_g^2 + \sigma_r^2 = 1\)


  • \(K\), the double kinship matrix
  • \(\Sigma_x\), the variance-covariance matrix of predictor \(x\)
  • \(\sigma_g^2\), \(\sigma_r^2\), variance proportions

GAIT2 Spanish families (previous project)

The Genetic Analysis of Idiopathic Thrombophilia 2 (GAIT2) Project

  • Study of Venous Thrombosis
    • disease prevalence <1%
    • heritability \(\sim\) 60%
  • 935 individuals in 35 families (27 per family on average)
  • Hundreds of phenotypes (blood coagulation system)
  • Genotype and RNA-seq data

Developed tools for analysis of family-based samples

  • solarius R package [makes SOLAR easier to use]
  • lme4qtl R package [makes lme4 flexible]